Narrowing the semantic gap - improved text-based web document retrieval using visual features

نویسندگان

  • Rong Zhao
  • William I. Grosky
چکیده

In this paper, we present the results of our work that seek to negotiate the gap between low-level features and high-level concepts in the domain of web document retrieval. This work concerns a technique, latent semantic indexing (LSI), which has been used for textual information retrieval for many years. In this environment, LSI determines clusters of co-occurring keywords— sometimes called concepts—so that a query which uses a particular keyword can then retrieve documents perhaps not containing this keyword, but containing other keywords from the same cluster. In this paper, we examine the use of this technique for content-based web document retrieval, using both keywords and image features to represent the documents. Two different approaches to image feature representation, namely, color histograms and color anglograms, are adopted and evaluated. Experimental results show that LSI, together with both textual and visual features, is able to extract the underlying semantic structure of web documents, thus helping to improve the retrieval performance significantly, even when querying is done using only keywords.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Web Page Structure Enhanced Feature Selection for Classification of Web Pages

Web page classification is achieved using text classification techniques. Web page classification is different from traditional text classification due to additional information, provided by web page structure which provides much information on content importance. HTML tags provide visual web page representation and can be considered a parameter to highlight content importance. Textual keywords...

متن کامل

Bridging the Semantic Gap

Content-based image retrieval systems were introduced as an alternative to avoid the need of manual tagging in traditional keyword-based image retrieval systems. However, the representation of image using visual features only involves a loss of information which is referred to as semantic gap. A number of techniques have been proposed to deal with ‘semantic gap’. This paper reviews existing app...

متن کامل

A survey of content-based image retrieval with high-level semantics

In order to improve the retrieval accuracy of content-based image retrieval systems, research focus has been shifted from designing sophisticated low-level feature extraction algorithms to reducing the ‘semantic gap’ between the visual features and the richness of human semantics. This paper attempts to provide a comprehensive survey of the recent technical achievements in high-level semantic-b...

متن کامل

Content Based Image Retrieval of User’s Interest Using Interactive Genetic Algorithm: a Review

-In present time, digital image libraries and other multimedia databases have been suddenly expanded. Therefore Semantic gap that between the visual features and human semantics has become very important area of research known as content based image retrieval (CBIR). If there is a need of retrieving an image from a large image database effectively and precisely, the development of content-based...

متن کامل

Combining Textual and Visual Cues for Content-based Image Retrieval on the World Wide Web

A system is proposed that combines textual and visual statistics in a single index vector for content-based search of a WWW image database. Textual statistics are captured in vector form using latent semantic indexing (LSI) based on text in the containing HTML document. Visual statistics are captured in vector form using color and orientation histograms. By using an integrated approach, it beco...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • IEEE Trans. Multimedia

دوره 4  شماره 

صفحات  -

تاریخ انتشار 2002